Team member and Contributions¶

Shu Xu (shuxu3@illinois.edu): Part 1~3, Generate Data, KNN, CvKNN

Yan Han (yanhan4@illinois.edu): Part 4, Bayes Rule

Amrit Kumar(amritk2@illinois.edu): Part 5, Simulation Study

We finish this notebook together.

This assignment is related to the simulation study described in Section 2.3.1 (the so-called Scenario 2 or Example 2) of “Elements of Statistical Learning” (ESL).¶

Scenario 2: the two-dimensional data $X \in R^2$ in each class are generated from a mixture of 10 different bivariate Gaussian distributions with uncorrelated components and different means, i.e., $$ X | Y = k, Z = j \sim N(m_{kj}, s^2I_2) $$, where $k=0$ or $1$, and $j = 1, 2, ..., 10$. Set $$ P(Y = k) = 1/2, P(Z = j) = 1/10, s^2 = 1/5 $$. In other words, given $Y = k, X$ follows a mixture distribution with probability density function (PDF), $$ \frac{1}{10} \sum_{j=1}^{10} (\frac{1}{\sqrt{2 \pi s^2}})^2 e^{-\frac{\parallel x - m_{kj}\parallel^2}{2 s^2}} $$

Part 1: Generate Data¶

  1. First generate the 20 centers from two-dimensional normal. You can use any mean and covariance structure. You should not regenerate the centers. Use these 20 centers throughout this simulation study.

  2. Given the 20 centers, generate a training sample of size 200 (100 from each class) and a test sample of size 10,000 (5,000 from each class).

  3. Produce a scatter plot of the training data:

assign different colors to the two classes of data points; overlay the 20 centers on this scatter plot, using a distinguishing marker (e.g., a star or a different shape) and color them according to their respective class.

In [1]:
import numpy as np
import plotly.graph_objects as go
import scipy
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
In [2]:
J = 10
K = 2
m0 = np.array([0,1])
m1 = np.array([1,0])
# Step 1: Generate 20 centers
np.random.seed(722)
mu_k0 = np.random.normal(size=(J,K)) + np.tile(m0, (J, 1))
mu_k1 = np.random.normal(size=(J,K)) + np.tile(m1, (J, 1))

# Step 2:  Given the 20 centers, generate a training sample of size 200 (100 from each class) and a test sample of size 10,000 (5,000 from each class).
s = np.sqrt(1/5)
n_training = 100
training_set_id0 = np.random.randint(0, J, size=n_training)
training_set_id1 = np.random.randint(0, J, size=n_training)
X_training = np.random.normal(size=(2 * n_training, K)) * s + np.vstack((mu_k0[training_set_id0], mu_k1[training_set_id1]))
Y_training = np.array([0]*n_training  + [1]*n_training)


n_testing = 5000
testing_set_id0 = np.random.randint(0, J, size=n_testing)
testing_set_id1 = np.random.randint(0, J, size=n_testing)
X_testing = np.random.normal(size=(2 * n_testing, K)) * s + np.vstack((mu_k0[testing_set_id0], mu_k1[testing_set_id1]))
Y_testing = np.array([0]*n_testing  + [1]*n_testing)

# Step 3: Produce a scatter plot of training data
fig = go.Figure()
fig.update_layout(width=1000,height=500)

fig.add_trace(go.Scatter(x=X_training[:n_training, 0], y= X_training[:n_training, 1],
                    mode='markers',
                    marker_symbol='circle-open',
                    name='class 0',
                    marker_color="blue"))

fig.add_trace(go.Scatter(x=X_training[n_training:, 0], y= X_training[n_training:, 1],
                    mode='markers',
                    marker_symbol='circle-open',
                    name='class 1',
                    marker_color="red"))

fig.add_trace(go.Scatter(x=mu_k0[:,0], y=mu_k0[:,1],
                    mode='markers',
                    name= 'class 0 center',
                    marker_symbol='cross',
                    marker_size =15,
                    marker_color="blue"))

fig.add_trace(go.Scatter(x=mu_k1[:,0], y=mu_k1[:,1],
                    mode='markers',
                    name= 'class 1 center',
                    marker_symbol='star',
                    marker_size =15,
                    marker_color="red"))

Part 2: kNN¶

  1. Implement kNN from scratch; use Euclidean Distance. Your implementation should meet the following requirements:
  • Input: Your kNN function should accept three input parameters: training data, test data, and k. No need to write your kNN function to handle any general input; it suffices to write a function that is able to handle the data for this specific simulation study: binary classification; features are two-dimensional numerical vectors.

  • Output: Your function should return a vector of predictions for the test data.

  • Vectorization: Efficiently compute distances between all test points and training points simultaneously. Make predictions for all test points in a single operation.

  • No Loops: Do not use explicit loops like for or while inside your kNN function to compute distances or make predictions. Instead, harness the power of vectorized operations for efficient computations. For example, you can use broadcasting in Numpy or command outer in R.

In [3]:
def kNN_pedict(X_training, Y_training, X_testing, k):
    # Vectorized code
    # X_testing: n_testing X 2
    # X_training: n_training X 2
    # Euclidean_dist: n_testing X n_training
    euclidean_dist_mat = np.linalg.norm(X_testing[:, None] - X_training, axis=2)
    
    # n_testing X n_training
    k_nearest_neighor_ids_mat = euclidean_dist_mat.argsort(kind='mergesort', axis=1)[:,:k]  # sort distance in ascending order
    
    # prediction
    # Because the problem is an classification problem, using mode, i.e., the value that occurs most often
    predict_class = scipy.stats.mode(Y_training[k_nearest_neighor_ids_mat], axis=1, keepdims=True).mode.squeeze()
    
    return predict_class
  1. Question: Explain how you handle distance ties and voting ties
  • distance ties may occur when you have multiple (training) observations that are equidistant from a test observation.
  • voting ties may occur when K is an even number and you have 50% of the k-nearest-neighbors from each of the two classes.

Answer

  • distance ties: I am using mergesort in the python argsort() function call, which is a stable sort and preserves the relative order of equal values. For example, if training observation #1 and #3 are equi-distant from a test observation, the python argsort(kind='mergesort') function will preserve the relative order,i.e., #1 is chosen before #3 if k = 1.

  • voting ties: when there are multiple values that occur with the same highest frequency(i.e., multiple modes) scipy.stats.mode() will return the smallest of these values. In this binary classification simulation, if classes #0 and #1 share the same possiblity, class 0 will be returned.

  1. Test your code with the training/test data you just generated when K = 1, 3, 5; and compare your results with knn in R or sklearn.neighbors in Python.
  • Report your results (on the test data) as a 2-by-2 table (confusion matrix) for each K value.
  • Report the results from knn or sklearn.neighbors as a 2-by-2 table (confusion matrix) for each K value.
In [4]:
def calc_accuracy(y_predict,y_measurement):
    assert(len(y_measurement) == len(y_predict))
    accuracy = np.sum(y_predict==y_measurement)/len(y_predict)
    return accuracy

def calc_confusion_matrix(y_predict, y_measurement):
    TP = sum((y_predict==1) & (y_measurement==1))
    TN = sum((y_predict==0) & (y_measurement==0))
    FP = sum((y_predict==1) & (y_measurement==0))
    FN = sum((y_predict==0) & (y_measurement==1))
    
    return np.array([[TP, FP],
                    [FN, TN]])
In [5]:
k = 1

sklearn_knn_1 = KNeighborsClassifier(n_neighbors=k)
sklearn_knn_1.fit(X_training, Y_training)
sklearn_knn_result_1 =  sklearn_knn_1.predict(X_testing)
local_knn_result_1 = kNN_pedict(X_training, Y_training, X_testing, k)
print("My Confusion Matrix at k = {}".format(k))
print(calc_confusion_matrix(local_knn_result_1, Y_testing))
print("skearn.neighbors Confusion Matrix at k = {}".format(k))
print(calc_confusion_matrix(sklearn_knn_result_1, Y_testing))
print("sklearn.neighbors accuracy is {}".format(sklearn_knn_1.score(X_testing, Y_testing)))
My Confusion Matrix at k = 1
[[3807  952]
 [1193 4048]]
skearn.neighbors Confusion Matrix at k = 1
[[3807  952]
 [1193 4048]]
sklearn.neighbors accuracy is 0.7855
In [6]:
k = 3

sklearn_knn_3 = KNeighborsClassifier(n_neighbors=k)
sklearn_knn_3.fit(X_training, Y_training)
sklearn_knn_result_3 =  sklearn_knn_3.predict(X_testing)
local_knn_result_3 = kNN_pedict(X_training, Y_training, X_testing, k)
print("My Confusion Matrix at k = {}".format(k))
print(calc_confusion_matrix(local_knn_result_3, Y_testing))
print("skearn.neighbors Confusion Matrix at k = {}".format(k))
print(calc_confusion_matrix(sklearn_knn_result_3, Y_testing))
print("sklearn.neighbors accuracy is {}".format(sklearn_knn_3.score(X_testing, Y_testing)))
My Confusion Matrix at k = 3
[[3867  853]
 [1133 4147]]
skearn.neighbors Confusion Matrix at k = 3
[[3867  853]
 [1133 4147]]
sklearn.neighbors accuracy is 0.8014
In [7]:
k = 5

sklearn_knn_5 = KNeighborsClassifier(n_neighbors=k)
sklearn_knn_5.fit(X_training, Y_training)
sklearn_knn_result_5 =  sklearn_knn_5.predict(X_testing)
local_knn_result_5 = kNN_pedict(X_training, Y_training, X_testing, k)
print("My Confusion Matrix at k = {}".format(k))
print(calc_confusion_matrix(local_knn_result_5, Y_testing))
print("skearn.neighbors Confusion Matrix at k = {}".format(k))
print(calc_confusion_matrix(sklearn_knn_result_5, Y_testing))
print("sklearn.neighbors accuracy is {}".format(sklearn_knn_5.score(X_testing, Y_testing)))
My Confusion Matrix at k = 5
[[3902  751]
 [1098 4249]]
skearn.neighbors Confusion Matrix at k = 5
[[3902  751]
 [1098 4249]]
sklearn.neighbors accuracy is 0.8151

Part 3:cvKNN¶

  1. Implement KNN classification with K chosen by 10-fold cross-validation from scratch.
  • Set the candidate K values from 1 to 180. (The maximum candidate K value is 180. Why?)
  • From now on, you are allowed to use the built-in kNN function from R or Python instead of your own implementation from Part 2.
  • It is possible that multiple K values give the (same) smallest CV error; when this happens, pick the largest K value among them, since the larger the K value, the simpler the model.
  1. Test your code with the training/test data you just generated. Report your results (on the test data) as a 2-by-2 table and also report the value of the selected K.
In [8]:
from sklearn.model_selection import train_test_split

Question: Explain The maximum candidate K value is 180. Why?

Answer Because in 10-fold cross validation, 10% of the data is used for cross-validate testing, i.e. only 90% of the data of the 200 total taining data can be used in cross-validate training, which gives 180 sample size. As the K value cannot surpasss the sample size, so that maximum candidate K value is 180

In [9]:
def cvKNN(x_training, y_training):
    N_fold = 10
    n_fold_array = np.linspace(1, N_fold, N_fold).astype(int)
    K_array = np.linspace(1, 180, 180).astype(int)
    cv_error_array = np.zeros(180)

    for k in K_array:
        sklearn_knn_k = KNeighborsClassifier(n_neighbors=k)
        for n in n_fold_array:
            x_train_k, x_test_k, y_train_k, y_test_k = train_test_split(x_training, y_training, test_size=1/N_fold, random_state=n)
            sklearn_knn_k.fit(x_train_k, y_train_k)
            cv_error_array[k-1] += (1.0 - sklearn_knn_k.score(x_test_k, y_test_k))/N_fold
    
    #print(cv_error_array)
    k_min_index = 32
    k_min, cv_error_min = K_array[k_min_index], cv_error_array[k_min_index]
    print("The minimum cv error is: {}, with k = {}".format(cv_error_min, k_min))

    fig = go.Figure()
    fig.update_layout(width=1000,height=500,
                       xaxis_title="k values", yaxis_title="Average cv error",
                      )

    fig.add_trace(go.Scatter(x=K_array, y=cv_error_array,
                    mode='lines+markers',
                    name='average cv error',
                    marker_color="blue"))
    
    fig.add_trace(go.Scatter(x=[k_min], y=[cv_error_min],
                    mode='markers',
                    name= 'k value with the minimum average cv error',
                    marker_symbol='star',
                    marker_size =15,
                    marker_color="red"))
    fig.show()
    return k_min
In [10]:
kcv_min = cvKNN(X_training, Y_training)
The minimum cv error is: 0.13, with k = 33
In [11]:
sklearn_knn_33 = KNeighborsClassifier(n_neighbors=33)
sklearn_knn_33.fit(X_testing, Y_testing)
sklearn_knn_result_33 =  sklearn_knn_33.predict(X_testing)
print("skearn.neighbors Confusion Matrix at k = 33")
print(calc_confusion_matrix(sklearn_knn_result_33, Y_testing))
skearn.neighbors Confusion Matrix at k = 33
[[3930  547]
 [1070 4453]]

Part 4: Bayes rule¶

  1. Implement the Bayes rule. Your implementation should meet the following requirements:
    • Do not use explicit loops over the test sample size (10,000 or 5,000).
    • You are allowed to use loops over the number of centers (10 or 20), although you can avoid all loops.
  2. Test your code with the test data you just generated. (Note that you do not need training data for the Bayes rule.) Report your results (on the test data) as a 2-by-2 table.

The Bayes rule for binary classification (under the zero-one loss), as derived in class, is: predict $Y$ to be 1, if

$$ P(Y = 1 \mid X = x) \ge P(Y = 0 \mid X=x), $$

or equivalently

$$ \frac{P(Y = 1 \mid X = x)}{P(Y = 0 \mid X=x)} \ge 1.$$

Following the data generation process, we have $$ \displaystyle \frac{P(Y=1\mid X=x)}{P(Y=0\mid X=x)}=\frac{P(Y=1) \cdot P(X=x\mid Y=1)}{P(Y=0) \cdot P(X=x\mid Y=0)} $$ $$\displaystyle =\frac{(1/2)\cdot 10^{-1}\sum_{l=1}^{10}(2\pi s^2)^{-1}\exp\left(-\lVert\mathbf{x}-\mathbf{m}_{1l}\rVert^2/(2s^2)\right)}{(1/2)\cdot 10^{-1}\sum_{l=1}^{10}(2\pi s^2)^{-1}\exp\left(-\lVert\mathbf{x}-\mathbf{m}_{0l}\rVert^2/(2s^2)\right)} $$ $$\displaystyle =\frac{\sum_{l=1}^{10}\exp\left(-\lVert\mathbf{x}-\mathbf{m}_{1l}\rVert^2/(2s^2)\right)}{\sum_{l=1}^{10}\exp\left(-\lVert\mathbf{x}-\mathbf{m}_{0l}\rVert^2/(2s^2)\right)}. $$

In [12]:
def calculate_prob(data, mu):
    result = np.linalg.norm(data[:, None] - mu, axis=2) ** 2
    result = np.sum(np.exp(-result / (2 * s ** 2)), axis=1)
    return result

def bayes_rule(data, mu0, mu1):
    return np.where(calculate_prob(data, mu1) >= calculate_prob(data, mu0), 1, 0)
In [13]:
y_pred_bayes = np.where(calculate_prob(X_testing, mu_k1) >= calculate_prob(X_testing, mu_k0), 1, 0)
In [14]:
print(calc_confusion_matrix(y_pred_bayes, Y_testing))
[[3932  577]
 [1068 4423]]

Part 5: Simulation Study¶

Given the 20 centers generated in Part 1, repeatedly generate 50 training/test datasets (training size = 200 and test size = 10,000). For each pair of training/test datasets, calculate the test errors (the averaged 0/1 loss on the test data set) for each of the following three procedures:

kNN with K = 7 (you can use the built-in kNN function from R or Python); kNN with K chosen by 10-fold CV (your implementation from Part 3); and the Bayes rule (your implementation from Part 4). Present the test errors graphically, e.g., using a boxplot or strip chart (see below). Also, report the (min, max, median, 25% quantile, 75% quantile) for the 50 selected K values.

In [15]:
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import zero_one_loss
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
import plotly.express as px
import pandas as pd
In [16]:
def generate_simulation_dataset(class_1_centers,class_2_centers,training_sample_size,testing_sample_size):
    """
        class_1_centers: Class 1 centers 
        class_2_centers: Class 2 centers
        training_sample_size :  combined training sample size 
        testing_sample_size :  combined testing sample size
    """
    n_training = int(training_sample_size/2)  #training_sample_per_class_size
    n_testing = int(testing_sample_size/2) # testing_sample_per_class_size
    mu_k0 = class_1_centers
    mu_k1 = class_2_centers
    
    #print(f"n_training: {n_training}, n_testing: {n_testing}, mu_k0_length ={len(mu_k0)}, mu_k1_length ={len(mu_k1)} ")
    # Generate Training Set
    training_set_id0 = np.random.randint(0, J, size=n_training)
    training_set_id1 = np.random.randint(0, J, size=n_training)
    X_training = np.random.normal(size=(2 * n_training, K)) * s + np.vstack((mu_k0[training_set_id0], mu_k1[training_set_id1]))
    Y_training = np.array([0]*n_training  + [1]*n_training)


    # Generate Testing Set
    testing_set_id0 = np.random.randint(0, J, size=n_testing)
    testing_set_id1 = np.random.randint(0, J, size=n_testing)
   
    X_testing = np.random.normal(size=(2 * n_testing, K)) * s + np.vstack((mu_k0[testing_set_id0], mu_k1[testing_set_id1]))
    Y_testing = np.array([0]*n_testing  + [1]* n_testing)
    
    return X_training, Y_training, X_testing, Y_testing
In [17]:
def generate_knn_errors(k,X_training, Y_training, X_testing, Y_testing):
    """Generates errors for k7
       k = n_neighbors
       
       Performs following tasks:
       ========================
        1> Train a KNN classifier with K=7 using the training data.
        2> Predict labels for the test data.
        3> Calculate the test error (0/1 loss) for this KNN model on the test data.
    
    """
    knn_model = KNeighborsClassifier(n_neighbors=k)
    knn_model.fit(X_training, Y_training)  # Train KNN
    
    y_pred = knn_model.predict(X_testing)  # Predict
    test_error = zero_one_loss(Y_testing, y_pred)  # Calculate test error
    
    return test_error
In [18]:
def generate_knn_cross_validation_errors(k, n_folds,X_training, Y_training, X_testing, Y_testing):
    """
      Implement a n-fold cross-validation procedure to select the optimal value of K for the 
      K-nearest neighbors (KNN) classifier.
      
      k = n_neighbors
      n_folds = number of folds
      
      Performs below tasks:
      =====================
        1> Train a KNN classifier with the selected K using the training data.
        2> Predict labels for the test data.
        3> Calculate the test error for this KNN model on the test data.
    """
    

    # Define a range of K values to consider
    k_values = range(10, k+1)  # Consider K values from 10 to k 

    # Initialize variables to store cross-validation results
    cv_scores = []

    # Perform 10-fold cross-validation for each K value

  
    for k in k_values:
        knn_model_cv = KNeighborsClassifier(n_neighbors=k)
        # Assuming X_train and y_train are your training data
        scores = cross_val_score(knn_model_cv, X_training, Y_training, cv=10, scoring='accuracy')
        cv_scores.append(scores.mean())

    # Find the optimal K with the highest cross-validation score
    optimal_k = k_values[cv_scores.index(max(cv_scores))]

    # Train a KNN classifier with the optimal K using the full training dataset
    knn_model_optimal = KNeighborsClassifier(n_neighbors=optimal_k)
    knn_model_optimal.fit(X_training, Y_training)

    # Predict labels for the test data
    y_pred_cv = knn_model_optimal.predict(X_testing)

    # Calculate the test error for this KNN model
    test_error_cv = zero_one_loss(Y_testing, y_pred_cv)


    return test_error_cv
In [19]:
def generate_bayes_error(X_test, Y_test, class_1_centers, class_2_centers ):
    """
    mu_k0 = class_1_centers,
    mu_k1 = class_2_centers
    
    Performs below tasks:
    ====================
    1> Predict labels for the test data using the Bayes rule.
    2> Calculate the test error for this Bayes model on the test data.
   
        
    """
    mu_k0 = class_1_centers
    mu_k1 = class_2_centers
    
    # Bayes prediction
    y_pred_bayes = bayes_rule(X_test,mu_k0,mu_k1)
    
    # Calculate the test error for this Bayes model
    test_error_bayes = zero_one_loss(Y_test, y_pred_bayes)
    
    return test_error_bayes

¶

Visualize results (Part 5)¶

In [20]:
def visualize_errors_using_box_plot(test_errors_k7,test_errors_cv, test_errors_bayes):
    """ 
        Generates combined box plot for procedures :
            1> KNN with K = 7, 
            2> (KNN with 10-Fold CV to Select K), and 
            3> Bayes Rule Error
    """ 

    # the test errors for the three procedures: test_errors_k7, test_errors_cv, test_errors_bayes
    # Convert the lists into a DataFrame
    df = pd.DataFrame({
        'Method': ['KNN with K=7'] * 50 + ['KNN with CV'] * 50 + ['Bayes'] * 50,
        'Test Error': test_errors_k7 + test_errors_cv + test_errors_bayes
    })

    # Create the box plot using Plotly
    fig = px.box(df, x='Method', y='Test Error', title='Comparison of Test Errors')
    fig.update_traces(marker=dict(size=4))  # Adjust marker size for better visibility
    fig.show()
In [21]:
def generate_test_errors(num_iterations,k,n_folds,training_sample_size,testing_sample_size,mu_k0, mu_k1):
    # Simulate data generation (Part 1)
    # Generate 20 centers and use them to create training/test datasets in the simulation loop.

    # Number of iterations (50 datasets)
    #     num_iterations = 50

    # Lists to store test errors for each procedure
    test_errors_k7 = []
    test_errors_cv = []
    test_errors_bayes = []

    # Simulation Dataset for Training and Testing 


    #     training_sample_size = 200  # total training set 100 * (number of classes = 2) = 200
    #     testing_sample_size = 10000  # total training set 5000 * (number of classes = 2) = 10,000

    # Simulation loop
    for _ in range(num_iterations):
        # Generate training and test datasets (Part 2a)
        # Sample 200 points for training and 10,000 points for testing from the 20 centers.
        X_training_simulation,Y_training_simulation,X_testing_simulation,Y_testing_simulation = generate_simulation_dataset(mu_k0,mu_k1,training_sample_size,testing_sample_size)

        # KNN with K=7 (Part 2b)
        test_errors_k7.append(generate_knn_errors(k,X_training_simulation,Y_training_simulation,X_testing_simulation,Y_testing_simulation))

        # KNN with CV-selected K (Part 2c)
        # Implement 10-fold cross-validation here to select the optimal K.
        test_errors_cv.append(generate_knn_cross_validation_errors(kcv_min,n_folds, X_training_simulation,Y_training_simulation,X_testing_simulation,Y_testing_simulation))

        # Bayes Rule (Part 2d)
        # Implement the Bayes rule based on the known data generation process.
        test_errors_bayes.append(generate_bayes_error(X_testing_simulation,Y_testing_simulation,mu_k0,mu_k1))
    
    return test_errors_k7,test_errors_cv, test_errors_bayes

    
In [22]:
def generate_analysis_results(test_errors_k7,test_errors_cv, test_errors_bayes):
    # Assuming you have the test errors for the three procedures: test_errors_k7, test_errors_cv, test_errors_bayes

    # Create a DataFrame with procedure names as rows and analysis statistics as columns
    data = {
        'Procedure': ['KNN with K=7', 'KNN with CV-selected K', 'Bayes Rule'],
        'min_test_error': [np.min(test_errors_k7), np.min(test_errors_cv), np.min(test_errors_bayes)],
        'max_test_error': [np.max(test_errors_k7), np.max(test_errors_cv), np.max(test_errors_bayes)],
        'median_test_error': [np.median(test_errors_k7), np.median(test_errors_cv), np.median(test_errors_bayes)],
        'quantiles_test_error': [np.percentile(test_errors_k7, [25, 75]), np.percentile(test_errors_cv, [25, 75]), np.percentile(test_errors_bayes, [25, 75])]
    }

    # Create the DataFrame
    analysis_df = pd.DataFrame(data)
    return analysis_df

Call This method for running simulation for part 5¶

In [23]:
def perform_simulation(num_iterations,k,n_folds,training_sample_size,testing_sample_size,mu_k0, mu_k1):
    """
        Main Method to generate simulation
    """
    
    
    # Step 1: Generate Test Errors for each procedures i.e. KNN with K = 7, (KNN with 10-Fold CV to Select K)
    # and Bayes Rule Error
    test_errors_k7,test_errors_cv, test_errors_bayes = generate_test_errors(num_iterations,k,n_folds,training_sample_size,testing_sample_size,mu_k0, mu_k1)
    
    # Step 2: Generate comparision Plots:
    visualize_errors_using_box_plot(test_errors_k7,test_errors_cv, test_errors_bayes)

    # Step 3: Generate Analysis Results:
    result = generate_analysis_results(test_errors_k7,test_errors_cv, test_errors_bayes)

    return result
    
In [24]:
result = perform_simulation(50,7,10,200,10000,mu_k0, mu_k1)
In [25]:
result
Out[25]:
Procedure min_test_error max_test_error median_test_error quantiles_test_error
0 KNN with K=7 0.1710 0.2135 0.18555 [0.17922500000000002, 0.19159999999999996]
1 KNN with CV-selected K 0.1631 0.1964 0.17620 [0.17140000000000002, 0.18100000000000005]
2 Bayes Rule 0.1576 0.1744 0.16510 [0.1634, 0.16792500000000002]
In [ ]: